肿瘤配对样本用varscan 做cnv分析
VarScan v2.2.4 以及更新的版本里面添加了这个分析,主要是针对matched tumor-normal pairs的WES数据来进行分析,得到的是肿瘤测序数据相当于其正常组织测序结果的拷贝数变化情况。输出文件是经典的 chromosome, start, stop, and log-base-2 of the copy number change
与拷贝数芯片输出结果类似,所以这个结果可以被bioconductor的DNAcopy包来进行segment算法处理。
详细用法见:http://varscan.sourceforge.net/copy-number-calling.html
输入数据,bam文件
这里选取的是前面博文 ESCC-肿瘤空间异质性探究 | 生信菜鸟团 提到的数据,如下:
-r--r--r-- 1 jianmingzeng jianmingzeng 17G Sep 22 00:01 ESCC13-N_recal.bam
-r--r--r-- 1 jianmingzeng jianmingzeng 22G Sep 22 02:09 ESCC13-T1_recal.bam
-r--r--r-- 1 jianmingzeng jianmingzeng 21G Sep 22 01:59 ESCC13-T2_recal.bam
-r--r--r-- 1 jianmingzeng jianmingzeng 16G Sep 21 23:29 ESCC13-T3_recal.bam
-r--r--r-- 1 jianmingzeng jianmingzeng 17G Sep 22 00:32 ESCC13-T4_recal.bam
`
是同一个病人的4个不同部位的肿瘤,共享同一个正常对照测序数据。
CNV流程
step1:得到raw copynumber calls
normal='/home/jianmingzeng/data/public/escc/bam/ESCC13-N_recal.bam'
GENOME='/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta'
for tumor in /home/jianmingzeng/data/public/escc/bam/*-T*.bam
do
start=$(date +%s.%N)
echo `date`
file=$(basename $tumor)
sample=${file%%.*}
echo calling $sample
samtools mpileup -q 1 -f $GENOME $normal $tumor |\
awk -F"\t" '$4 > 0 && $7 > 0' |\
java -jar ~/biosoft/VarScan/VarScan.v2.3.9.jar copynumber - $sample --mpileup 1
echo `date`
dur=$(echo "$(date +%s.%N) - $start" | bc)
printf "Execution time for $sample : %.6f seconds" $dur
echo
done
不过该软件似乎是在这个地方有一个bugs,我也是查了biostar论坛才发现的。主要是两个样本的mpileup文件不能有0,所以需要过滤一下。
这个步骤会产生以copynumber为后缀的输出文件,内容如下:
chrom chr_start chr_stop num_positions normal_depth tumor_depth log2_ratio gc_content
1 10023 10122 100 14.5 11.1 -0.382 50.0
1 10123 10163 41 15.3 12.4 -0.306 53.7
1 10165 10175 11 12.2 11.5 -0.089 54.5
1 10180 10215 36 11.0 10.9 -0.015 50.0
1 12949 13048 100 17.5 17.0 -0.041 62.0
1 13049 13148 100 26.5 32.9 0.314 61.0
1 13149 13248 100 50.2 52.2 0.056 57.0
1 13249 13348 100 59.7 79.3 0.409 60.0
1 13349 13448 100 56.4 70.1 0.313 53.0
step2:对GC含量进行校正
众所周知,以illumina为代表的二代测序对不同GC含量的片段测序效率不同,需要进行校正。而上一个步骤记录了每个片段的GC含量,所以这个步骤就用算法校正一下。
java -jar ~/biosoft/VarScan/VarScan.v2.3.9.jar copyCaller ESCC13-T1_recal.copynumber --output-file varScan.T1.cnv.called
似乎挑选标准比较简单,而且出来的结果太多了。
Reading input from ESCC13-T1_recal.copynumber
1959962 raw regions parsed
498738 met min depth
498738 met min size
206188 regions (20520229 bp) were called amplification (log2 > 0.25)
194622 regions (19384202 bp) were called neutral
97928 regions (9726174 bp) were called deletion (log2 <-0.25)
213 regions (20539 bp) were called homozygous deletion (normal cov >= 20 and tumor cov <= 5)
step3:导入DNAcopy包
rm(list=ls())
library(DNAcopy)
a=read.table('varScan/varScan.T3.cnv.called',stringsAsFactors = F,header = T)
head(a)
CNA.object <- CNA(cbind( a$adjusted_log_ratio),
a$chrom,a$chr_start,
data.type="logratio",sampleid="T3")
smoothed.CNA.object <- smooth.CNA(CNA.object)
segment.smoothed.CNA.object <- segment(smoothed.CNA.object, verbose=1)
sdundo.CNA.object <- segment(smoothed.CNA.object,
undo.splits="sdundo",
undo.SD=3,verbose=1)
head(sdundo.CNA.object$output)
dim(sdundo.CNA.object$output)
dim(sdundo.CNA.object$data)
dim(sdundo.CNA.object$segRows)
head(sdundo.CNA.object$segRows)
tail(sdundo.CNA.object$segRows)
save.image(file = 'DNAcopy_T3.Rdata')
近50万行的数据经过了segment算法得到了三千多个拷贝数变异区域,但是有不少区域非常短,其实是需要进一步过滤的。
> head(sdundo.CNA.object$output)
ID chrom loc.start loc.end num.mark seg.mean
1 T3 1 13049 1666302 1548 -0.2183
2 T3 1 1669546 1684191 11 -0.9250
3 T3 1 1684291 2704373 1037 -0.1705
4 T3 1 2705756 2705956 3 -1.7230
5 T3 1 2706056 6647084 1568 -0.2122
6 T3 1 6647184 6647584 5 0.6866
> dim(sdundo.CNA.object$output)
[1] 3079 6
> dim(sdundo.CNA.object$data)
[1] 499065 3
> dim(sdundo.CNA.object$segRows)
[1] 3079 2
> head(sdundo.CNA.object$segRows)
startRow endRow
1 1 1548
2 1549 1559
3 1560 2596
4 2597 2599
5 2600 4167
6 4168 4172
> tail(sdundo.CNA.object$segRows)
startRow endRow
3074 497115 498045
3075 498046 498048
3076 498049 498782
3077 498783 498785
3078 498786 498903
3079 498904 499065
找到拷贝数变异后,除了需要过滤,还需要进行注释,如果有多个样本,还需要在群体里面找寻找recurrent copy number alterations (RCNAs)
比如下面的文章:
We recently published Correlation Matrix Diagonal Segmentation (CMDS) to search for RCNAs using SNP array-based copy number data, and this program can be applied to exome-based copy number data as well. (这个一般用GISTIC咯)